Encoder in the Transformer

Each encoder in the Transformer consists of two main parts Multihead SelfAttention followed by Add Normalize...

Encoder in the Transformer

Gaurav
February 12, 2024

Encoder in the Transformer:đź”—

Encoder Structure

Each encoder in the Transformer consists of two main parts:

  1. Multi-head Self-Attention followed by Add & Normalize: This is the self-attention mechanism we discussed earlier, where the input sequence attends to all positions in itself. After the attention scores are computed and used to produce an output, a residual connection (i.e., the "add" operation) is added, and then layer normalization is applied.

  2. Position-wise Feed-Forward Networks followed by Add & Normalize: This consists of two linear transformations with a ReLU activation in between. Just like in the self-attention mechanism, after the feed-forward network, a residual connection is added, followed by layer normalization.

  • XX is the matrix after positional encoding.

Equations for the Encoder's Operations:đź”—

  1. Self-Attention:
Self-Attention(X)=MultiHead(X,X,X)\text{Self-Attention}(X) = \text{MultiHead}(X, X, X)
  1. Add & Normalize after Self-Attention:
Output_after_attention=LayerNorm(X+Self-Attention(X))\text{Output\_after\_attention} = \text{LayerNorm}(X + \text{Self-Attention}(X))
  1. Feed-Forward:

Assuming the feed-forward network consists of two linear layers with weights W1W_1 and W2W_2, and biases b1b_1 and b2b_2, and using ReLU as the activation function, it can be represented as:

FFN(X)=ReLU(XĂ—W1+b1)Ă—W2+b2\text{FFN}(X) = \text{ReLU}(X \times W_1 + b_1) \times W_2 + b_2
  1. Add & Normalize after Feed-Forward:
Output_of_encoder=LayerNorm(Output_after_attention+FFN(Output_after_attention))\text{Output\_of\_encoder} = \text{LayerNorm}(\text{Output\_after\_attention} + \text{FFN}(\text{Output\_after\_attention}))

So, the output of the encoder, after processing the input matrix XX (with positional encodings), is Output_of_encoder\text{Output\_of\_encoder}. If there are multiple encoder layers in the Transformer, this output will serve as the input XX for the next encoder layer.


Encoder Process with Parameters:đź”—

Encoder Process with Parameters

  1. Word Embeddings & Positional Encoding: Sentence→Word Embeddings (Embedding matrix)→Add Positional Encoding→X\text{Sentence} \rightarrow \text{Word Embeddings (Embedding matrix)} \rightarrow \text{Add Positional Encoding} \rightarrow X

  2. Multi-Head Self-Attention:

    • For each head ii: X→WQi,WKi,WViQi,Ki,ViX \xrightarrow{W_{Qi}, W_{Ki}, W_{Vi}} Q_i, K_i, V_i
    • Self-attention for each head: Qi,Ki,Vi→Scaled Dot-Product Attention→ZiQ_i, K_i, V_i \rightarrow \text{Scaled Dot-Product Attention} \rightarrow Z_i
    • Combine all heads: Concatenate Zi matrices→WOZcombined\text{Concatenate } Z_i \text{ matrices} \xrightarrow{W_O} Z_{\text{combined}}
  3. Add & Normalize after Self-Attention: X+Zcombined→Layer Normalization (with learned parameters γ and β)→YX + Z_{\text{combined}} \rightarrow \text{Layer Normalization (with learned parameters $\gamma$ and $\beta$)} \rightarrow Y

  4. Position-wise Feed-Forward Network (FFN):

    • FFN Parameters: W1,b1W_1, b_1 for the first layer and W2,b2W_2, b_2 for the second layer.

    Y→W1,b1Linear Transformation→ReLU Activation→W2,b2Linear Transformation→FY \xrightarrow{W_1, b_1} \text{Linear Transformation} \rightarrow \text{ReLU Activation} \xrightarrow{W_2, b_2} \text{Linear Transformation} \rightarrow F

  5. Add & Normalize after FFN: Y+F→Layer Normalization (with learned parameters γ and β)→OY + F \rightarrow \text{Layer Normalization (with learned parameters $\gamma$ and $\beta$)} \rightarrow O

Where:

  • WQiW_{Qi}, WKiW_{Ki}, and WViW_{Vi} are the weight matrices for computing the Query, Key, and Value for the ithi^{th} attention head.
  • WOW_O is the weight matrix for combining the outputs of all attention heads.
  • W1,b1W_1, b_1 are the weight matrix and bias for the first linear transformation in the FFN.
  • W2,b2W_2, b_2 are the weight matrix and bias for the second linear transformation in the FFN.
  • Îł\gamma and β\beta are the learned scale and shift parameters for layer normalization.

The final output after one encoder layer is OO. If there are more encoder layers, OO would serve as the input XX for the next encoder layer.

COMING SOON ! ! !

Till Then, you can Subscribe to Us.

Get the latest updates, exclusive content and special offers delivered directly to your mailbox. Subscribe now!

ClassFlame – Where Learning Meets Conversation! offers conversational-style books in Computer Science, Mathematics, AI, and ML, making complex subjects accessible and engaging through interactive learning and expertly curated content.


© 2024 ClassFlame. All rights reserved.